Skip to content

Conversation

@burqen
Copy link
Contributor

@burqen burqen commented Nov 12, 2025

Previously both DeduplicatingFieldInfosFormat and TSDBSyntheticIdCodec.RewriteFieldInfosFormat would iterate over FieldInfos.

This is now optimised by letting TSDBSyntheticIdCodec.RewriteFieldInfosFormat extend DeduplicatingFieldInfosFormat in order to let RewriteFieldInfosFormat utilise the iteration done by DeduplicatingFieldInfosFormat.

Also let TSDBSyntheticIdCodec extend DeduplicateFieldInfosCodec so that we can simplify the codec wrapping to always use only one of them.

Additional changes:

  • Elasticsearch versioned codex are always wrapped by DeduplicateFieldInfosCodec, instead of extending it. This makes it possible for TSDBSyntheticIdCodec to extend DeduplicateFieldInfosCodec and hopefully as a side effect make it easier to reason about.
  • Move DecouplingFieldInfosCodec to upper level.

No codec should extend DeduplicatingFieldInfoCodec, instead always wrap
it.
Previously both DeduplicatingFieldInfosFormat and
TSDBSyntheticIdCodec.RewriteFieldInfosFormat would iterate over
FieldInfos.

This is now optimised by letting
TSDBSyntheticIdCodec.RewriteFieldInfosFormat extend
DeduplicatingFieldInfosFormat in order to let RewriteFieldInfosFormat
utilise the iteration done by DeduplicatingFieldInfosFormat.

Also let TSDBSyntheticIdCodec extend DeduplicateFieldInfosCodec so that
we can simplify the codec wrapping to always use only one of them.
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @burqen, I've created a changelog YAML for you.

@burqen burqen changed the title Single loop for FielfInfo processing Single loop for FieldfInfo processing Nov 12, 2025
@burqen burqen requested review from fcofdez and tlrx November 12, 2025 14:43
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return codec;
String name = e.getValue().getName();
Codec codec = e.getValue();
return useTsdbSyntheticId ? new TSDBSyntheticIdCodec(name, codec) : new DeduplicateFieldInfosCodec(name, codec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return useTsdbSyntheticId ? new TSDBSyntheticIdCodec(name, codec) : new DeduplicateFieldInfosCodec(name, codec);
return useTsdbSyntheticId ? new TSDBSyntheticIdCodec(codec) : new DeduplicateFieldInfosCodec(codec);

Comment on lines 22 to 25
protected DeduplicateFieldInfosCodec(String name, Codec delegate) {
super(name, delegate);
this.fieldInfosFormat = createFieldInfosFormat(delegate.fieldInfosFormat());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected DeduplicateFieldInfosCodec(String name, Codec delegate) {
super(name, delegate);
this.fieldInfosFormat = createFieldInfosFormat(delegate.fieldInfosFormat());
}
protected DeduplicateFieldInfosCodec(Codec delegate) {
super(delegate.getName(), delegate);
this.fieldInfosFormat = createFieldInfosFormat(delegate.fieldInfosFormat());
}


protected void validateFieldInfos(FieldInfos fieldInfos) {}

protected FieldInfo processFieldInfo(FieldInfo fi) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
protected FieldInfo processFieldInfo(FieldInfo fi) {
protected FieldInfo wrapFieldInfo(FieldInfo fi) {


private static Codec unwrappedCodec(CodecService codecService, String codecName) {
Codec codec = codecService.codec(codecName);
if (codec instanceof DeduplicateFieldInfosCodec deduplicatingCodec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be:

Suggested change
if (codec instanceof DeduplicateFieldInfosCodec deduplicatingCodec) {
if (codec instanceof FilterCoded filtered) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, the delegate is hidden from us there.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that we should revert the change around not extending DeduplicateFieldInfosCodec in the default codecs. The reason is that Lucene would use SPI to load the Codec and it will just instantiate the codec with the no-args constructor and thus we won't get to deduplicate the fields. This only applies when a node is restarted for example and we need to read the codec from the SegmentInfo and I guess that it applies to search nodes in serverless too.

@burqen
Copy link
Contributor Author

burqen commented Nov 13, 2025

I'm afraid that we should revert the change around not extending DeduplicateFieldInfosCodec in the default codecs. The reason is that Lucene would use SPI to load the Codec and it will just instantiate the codec with the no-args constructor and thus we won't get to deduplicate the fields. This only applies when a node is restarted for example and we need to read the codec from the SegmentInfo and I guess that it applies to search nodes in serverless too.

I can see that Lucene is used to load the codecs here, but they will all be wrapped afterwards. Is there some other place as well where the codecs are service loaded outside of CodecService? https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/codec/CodecService.java#L66-L69

@fcofdez
Copy link
Contributor

fcofdez commented Nov 13, 2025

Is there some other place as well where the codecs are service loaded outside of CodecService?

Yes, in Lucene, when a commit is read from disk (see https://github.com/apache/lucene/blob/e02bdb4c3c547488342b423e1b9b2b25519bd427/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L409-L412). We kind of rely implicitly that for reading a segment with a particular codec we can use the SPI loaded codec.

@burqen
Copy link
Contributor Author

burqen commented Nov 13, 2025

Is there some other place as well where the codecs are service loaded outside of CodecService?

Yes, in Lucene, when a commit is read from disk (see https://github.com/apache/lucene/blob/e02bdb4c3c547488342b423e1b9b2b25519bd427/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L409-L412). We kind of rely implicitly that for reading a segment with a particular codec we can use the SPI loaded codec.

Got it! Thanks for pointing that out. I'll revert and find some other approach.

@martijnvg
Copy link
Member

Good observation @fcofdez, this indeed makes wrapping approach not effective. But it should still be possible to loop once over field info in TSDBSyntheticIdCodec if we extend from DeduplicateFieldInfosCodec?

@burqen
Copy link
Contributor Author

burqen commented Nov 14, 2025

@martijnvg The problem is that the Elasticsearch*Codecs need to inherit from DeduplicateFieldInfosCodec (if we want the service loaded codecs to benefit from deduplication which I assume we do). Even if TSDBSyntheticIdCodec also inherit from DeduplicateFieldInfosCodec we would still need it to wrap the original Elasticsearch*Codec which is also a DeduplicateFieldInfosCodec. So we would still iterate over FieldInfos twice (and even do the deduplication work twice).

I tried a more brute force approach

  1. Inline the synthetic id functionality in DeduplicatingFieldInfosFormat, protected behind useTsdbSyntheticId flag
    @Override
    public FieldInfos read(Directory directory, SegmentInfo segmentInfo, String segmentSuffix, IOContext iocontext) throws IOException {
        final FieldInfos fieldInfos = delegate.read(directory, segmentInfo, segmentSuffix, iocontext);
        if (useTsdbSyntheticId) {
            TSDBSyntheticIdCodec.ensureSyntheticIdFields(fieldInfos);
        }
        final FieldInfo[] deduplicated = new FieldInfo[fieldInfos.size()];
        int i = 0;
        for (FieldInfo fi : fieldInfos) {
            // ... The deduplication stuff
            if (useTsdbSyntheticId) {
                fieldInfo = TSDBSyntheticIdCodec.wrapFieldInfoOnRead(fieldInfo);
            }
            deduplicated[i++] = fieldInfo;
        }
        return new FieldInfosWithUsages(deduplicated);
    }
  1. Set the flag from CodecService by propagating through the codec
if (useTsdbSyntheticId) {
    ((DeduplicateFieldInfosCodec) codec).useTsdbSyntheticId();
    codec = new TSDBSyntheticIdCodec(codec.getName(), codec); // Still wrap for PostingsFormat, more on that later
}
  1. Profit?
    ... no. Because this makes the useTsdbSyntheticId state part of the codecs which are service loaded in other places, such as IndexDiskUsageAnalyzer which becomes super confused when it suddenly looks like the _id field have Postings (because it reads the FieldInfo through DeduplicatingFieldInfosFormat which pretends the posting exist on the read path).

This makes me think it's a bad idea to make deduplicating codec aware of of synthetic id all together. But I cannot figure out a way to get rid of the extra loop in any other way. All my ideas center around in one way or the other inject functionality into DeduplicateFieldInfosCodec and more specifically into the FieldInfoFormat that it provides.

This makes me come back to your comment @fcofdez . How important is it that the service loaded codec benefit from the deduplication? They will still be loaded, only they will not be wrapped with the deduplication. How does that weigh against this optimization that we are trying to do here?

Do you have any ideas? @fcofdez , @tlrx , @martijnvg ?

About PostingsFormat:
The Elasticsearch*Codecs define their own PostingsFormat. I tried to wrap those from inside DeduplicateFieldInfosCodec based on the useTsdbSyntheticId, but that seemed both hacky and error prone so I stopped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants